knitr document van Steensel lab

Introduction

I previously processed the raw sequencing data, optimized the barcode clustering, quantified the pDNA data and normalized the cDNA data. In this script, I want to have a detailed look at the cDNA data from a general perspective.

Description of Data

How to make a good rendering table:

column1 column2 column3
1 2 3
a b c

Data processing

Path, Libraries, Parameters and Useful Functions

knitr::opts_chunk$set(echo = TRUE)
StartTime <-Sys.time()

# 8-digit Date tag:
Date <- substr(gsub("-","",Sys.time()),1,8) 
# libraries:
library(RColorBrewer)
library(ggplot2)
library(dplyr)
library(maditr)
library(tibble)
library(pheatmap)
library(ggpubr)
library(ggbeeswarm)
library(ggforce)
library(viridis)
library(plyr)
library(cowplot)
library(gridExtra)
library(GGally)
library(readr)
library(stringr)
library(tidyr)

Custom functions

Functions used thoughout this script.

Data import

Analysis

First insights into data distribution - reporter activity distribution plots

## `geom_smooth()` using formula 'y ~ x'

Heat map - display mean log2-activity for each TF in each condition

Heatmap for native enhancers

Run FIMO script again

# motfn=/home/f.comoglio/mydata/Annotations/TFDB/Curated_Natoli/update_2017/20170320_pwms_selected.meme
# odir=/home/m.trauernicht/mydata/projects/tf_activity_reporter/data/SuRE_TF_1/results/native-enhancer/fimo
# query=/home/m.trauernicht/mydata/projects/tf_activity_reporter/data/SuRE_TF_1/results/native-enhancer/cDNA_df_native.fasta

# nice -n 19 fimo --no-qvalue --thresh 1e-4 --verbosity 1 --o $odir $motfn $query 

load fimo results

We built a TF motif matrix using -log10 transformed FIMO scores. We used this feature encoding throughout the rest of this analysis, unless otherwise stated.

visualize fimo results

Look at only expressed TFs in mESCs

Filter expressed TFs

Use FIMO matrix to build loglinear model

Binary presence of motif to explain expression variance

Heatmap per TF - comparing design activities mutated vs. non-mutated

Heatmap per TF - only WT TF activities

Compute activity changes relative to their negative controls

## pdf 
##   3
## pdf 
##   3

All of these heatmaps conclude that there we have informative reporters for ~10 TFs, and that the TF reporter design matters for some but not all TFs

SuperPlot of TF activity per condition - this way we can plot not only the mean, but the complete data distribution across technical and biological replicates

SuperPlots comparing different designs

Log-linear expression modelling to explain variance - model for each TF

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Log-linear expression modelling to explain variance - model for each TF - only WT - without condition

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Can expression variance be explained by the TF properties?

Finding the best reporters for a single TF

Make the same models as before - but now per TF and per condition

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Exporting potential data.

Session Info

paste("Run time: ",format(Sys.time()-StartTime))
## [1] "Run time:  5.0146 mins"
getwd()
## [1] "/DATA/usr/m.trauernicht/projects/SuRE-TF"
date()
## [1] "Thu Sep 24 14:17:48 2020"
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.7 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] tidyr_1.0.0        stringr_1.4.0      readr_1.3.1        GGally_1.5.0      
##  [5] gridExtra_2.3      cowplot_1.0.0      plyr_1.8.6         viridis_0.5.1     
##  [9] viridisLite_0.3.0  ggforce_0.3.1      ggbeeswarm_0.6.0   ggpubr_0.2.5      
## [13] magrittr_1.5       pheatmap_1.0.12    tibble_3.0.1       maditr_0.6.3      
## [17] dplyr_0.8.5        ggplot2_3.3.0      RColorBrewer_1.1-2
## 
## loaded via a namespace (and not attached):
##  [1] beeswarm_0.2.3    tidyselect_0.2.5  xfun_0.12         purrr_0.3.3      
##  [5] lattice_0.20-38   splines_3.6.3     colorspace_1.4-1  vctrs_0.2.4      
##  [9] htmltools_0.4.0   mgcv_1.8-31       yaml_2.2.0        rlang_0.4.5      
## [13] pillar_1.4.3      glue_1.3.1        withr_2.1.2       tweenr_1.0.1     
## [17] lifecycle_0.2.0   munsell_0.5.0     ggsignif_0.6.0    gtable_0.3.0     
## [21] evaluate_0.14     labeling_0.3      knitr_1.28        vipor_0.4.5      
## [25] Rcpp_1.0.3        scales_1.1.0      farver_2.0.1      hms_0.5.3        
## [29] digest_0.6.23     stringi_1.4.6     polyclip_1.10-0   grid_3.6.3       
## [33] tools_3.6.3       crayon_1.3.4      pkgconfig_2.0.3   Matrix_1.2-18    
## [37] ellipsis_0.3.0    MASS_7.3-51.5     data.table_1.12.8 assertthat_0.2.1 
## [41] rmarkdown_2.0     reshape_0.8.8     R6_2.4.1          nlme_3.1-143     
## [45] compiler_3.6.3